Intrusion detection method and system, related network and computer program product therefor

ABSTRACT

Intrusions in a system under surveillance are detected by matching the events occurring during operation of the system against a knowledge base including information on events which occurred during a learning phase. The detection technique includes the steps of: recording, during the learning phase, temporal data related to the events during the learning phase; identifying, as a function of the temporal data recorded, a dynamic part of the knowledge base; discovering patterns that cover the dynamic part of the knowledge base; and using, during the analysis phase, a regular expression match at least with respect to the dynamic part of the knowledge base.

CROSS REFERENCE TO RELATED APPLICATION

This application is a national phase application based onPCT/EP2004/013424, filed Nov. 26, 2004, the content of which isincorporated herein by reference.

FIELD OF THE INVENTION

The present invention relates to Intrusion Detection Systems (IDSs). Thepurpose of these systems is to detect security problems in computersystems and networks caused by the action of an external or internalagent that may damage the computer system or the network. The agent canbe an automatic system (i.e. a computer virus or a worm) or a humanintruder who tries to exploit some weaknesses in the system for aspecific purpose (i.e. unauthorized access to reserved data).

DESCRIPTION OF THE RELATED ART

The purpose of a computer Intrusion Detection System (IDS) is to collectand analyze the information on the activity performed on a givencomputer system in order to detect, as early as possible, evidence of amalicious behaviour.

Two fundamental mechanisms/arrangements have been developed so far inthe context of Intrusion Detection: Network-Based Intrusion DetectionSystems (NIDS) and Host-Based Intrusion Detection Systems (HIDS).

Network-Based Intrusion Detection Systems analyze packets that flow inthe system/network under surveillance, searching for anomalousactivities; the majority of Network-Based Intrusion Detection Systemsemploys pattern-based techniques to discover evidence of an attack.

Host-Based Intrusion Detection Systems work on a host-per-host basis,using a broader variety of techniques, to achieve their purpose.Host-Based Intrusion Detection Systems are usually better tailored fordetecting attacks that can have a real impact on the host under theircontrol.

Network-Based Intrusion Detection Systems have a broader vision of thecomputer network in comparison with their Host-Based counterparts. As aconsequence of this, Network-Based Intrusion Detection Systems cancorrelate different attacks more easily and can detect anomalies thatcan be neglected if only a single host is taken into account. Specificattacks, such as those that employ ciphered connections or some form ofcovert channels, are however extremely difficult to discover usingNetwork Based techniques only.

In conclusion, both mechanisms should notionally be used when deployinga complete Intrusion Detection System.

In order to measure the effectiveness of an Intrusion Detection System,it is possible to evaluate two fundamental figures: the rate ofFalse-Positives and the rate of False-Negatives. False-Positives arethose normal events that are erroneously detected as attacks;conversely, False-Negatives are effective attacks that are not correctlyidentified by the Intrusion Detection Systems.

A primary goal of an Intrusion Detection Systems is thus to minimizethese figures, while maintaining an acceptable analysis rate (that is,the number of events that can be analyzed in a time unit). Obviously,different technologies result in different False-Positives andFalse-Negatives rates.

The most commonly used techniques for Intrusion Detection Systems areMisuse Detection (MD) and Anomaly Detection (AD). Occasionally,Artificial Intelligence (AI) and State Analysis (SA) techniques havebeen used, but only in few implementations.

Misuse Detection is the technique commonly adopted in Network-BasedIntrusion Detection Systems. Usually, some pattern matching method isapplied over a series of rules to detect misuse conditions. Exemplary ofthis approach are the arrangements disclosed e.g. in U.S. Pat. No.5,278,901 or U.S. Pat. No. 6,487,666.

Pattern-Based systems are well suited for Network-Based IntrusionDetection Systems but are not very efficient in the context ofHost-Based Intrusion Detection Systems, where they can generate highFalse-Negatives rate, because this mechanism usually fails to detectsomething if a specific signature has not been provided.

The so-called Anomaly Detection Paradigm is another approach adopted inIntrusion Detection Systems. Anomaly Detection systems “learn” thenormal behaviour of the host or the network they should protect, bycollecting all events occurring during a training phase. These data areorganized into a structured knowledge base, which is then used in orderto detect events that significantly diverge from whatever has beenobserved previously. Those events are strongly suspicious, and aresignaled as indications of a likely incoming intrusion. For a detaileddiscussion of the application of Anomaly Detection techniques in thefield of Intrusion Detection, see D. Wagner and D. Dean “IntrusionDetection via Static Analysis”, IEEE Symposium on Security and Privacy,2001, pp. 156-169. Anomaly Detection may be used in both Network-basedand Host-based Intrusion Detection Systems, although it is fairly morecommon in this latter case.

Anomaly Detection is able to cope with unseen attack patterns, reducingthe False-Negative rate. However, it also shows a sensibly higherFalse-Positive rate, because certain permitted actions have not beenincluded in the policy or have not been observed during the learningstage.

To a certain extent, a kind of duality exists between Misuse Detectionand Anomaly Detection: a Misuse Detection system uses a set ofsignatures to detect attacks; so, it can only detect known attacks; onthe opposite side, Anomaly Detection system can detect unknown events,however it is difficult to distinguish between real attacks andunforeseen regular behaviour. Hence, each one has its own advantages anddisadvantages.

One of the most important advantages of an Anomaly Detection system liesin that such a system does not need periodical updates to keepeffectiveness, as a Misuse Detection system does; after the initialtraining phase, it is immediately able to detect also new attacks, forwhich there are still no signatures available.

The effectiveness of such an approach strongly relies and depends on thequality of the knowledge base built during the learning phase and on theperformances of the detection method: a system which is not well trainedmay generate a huge amount of False-Positives, detecting events thatdiverge from the knowledge base but are not related to a real incomingintrusion. In many cases, the amount of such useless data makes thewhole system unmanageable.

The effectiveness of an Anomaly Detection system thus largely depends onthe construction of the knowledge base, and several techniques have beenproposed for optimizing the learning process. This process must to besubstantially automatic, otherwise using the system effectively it wouldbe extremely difficult.

A basic example of the automatic construction of a knowledge base ispresented in Zanero and S. M. Savaresi: “Unsupervised LearningTechniques for an Intrusion Detection System”, Proceedings of the ACMSymposium on Applied Computing, ACM SAC 2004, 14-17 Mar. 2004.

In this paper, the authors describe a system that collects all packetsin the network under surveillance, and builds the knowledge base byclassifying applicative packet payloads to create a distribution of thetypical network traffic. The system then uses this application levelclassifier to detect significant deviation from the usual trafficpatterns. The most critical issue in such a system is the need ofdiscriminating between temporary network spikes, related to an increaseof regular activity, from the occurrence of a real attack. In general,certain mathematical models are used to make such a distinction;however, such models depend on some intrinsic parameters (thresholds)which are usually hard to estimate correctly.

In general, and independently of the specific context, an AnomalyDetection system struggles with the need of keeping the training phasereasonably short and the knowledge base small, while at the same time itshould be able to identify all the events that are perfectly admissiblein the system under analysis.

A quite critical situation arises when these systems have to cope with ahighly dynamic environment, where many different resources are usedtemporarily, oftentimes never to be used again under identicalcircumstances. For that reason, Anomaly Detection systems are usuallybest suited for static and slowly varying environments.

Document U.S. Pat. No. 6,742,124 discloses an Anomaly Detection Systembased on analysis of sequences of system calls, intercepted by asoftware wrapper. That system operates in real time through therepresentation of known sequences of system calls in a distance matrix.The distance matrix indirectly specifies known sequences of system callsby specifying allowable separation distances between pairs of systemcalls. The distance matrix is used to determine whether a sequence ofsystem calls in an event window represents an anomaly.

Document EP-A-0 985 995 describes another advanced application of thistechnique, which relies on the TEIRESIAS method. This method allow toidentify long patterns in system call sequences and, hence, allow tobetter define the correlation function that is used in the AnomalyDetection method, when a generic execution of the monitored process iscompared with a static set of system call sequences generated during thelearning stage.

Both the prior art documents considered in the foregoing disclosearrangements that perform detection by analyzing sequences of systemcalls issued during the normal process activity; this is indeed one ofthe most interesting applications of the Anomaly Detection technique inthe context of Host-based Intrusion Detection System. Generallyspeaking, a modern Operating System uses at least two different levelsof privilege for running applications; at user level, the applicationbehaviour is constrained and the single application cannot manipulatearbitrarily system wide resources, while at the kernel level, theapplication has a complete control over the system. The transitionbetween user and kernel level is regulated by the system calls, whichallow a non-trusted application to manipulate a system-wide resource.For example, using a system call an application can spawn (or terminate)another application, create a file, or establish a network connection.

Several documents discuss Pattern Discovery techniques. Such PatternDiscovery techniques are well known especially in the field ofBio-Informatics; in fact, these methods are widely used in that contextto extract protein and DNA/RNA (Deoxyribonucleic Acid/Ribonucleic Acid)patterns from long sequences of genetic material. In general, PatternDiscovery Techniques are used to extract some significant informationthat repeats itself over long sequences of apparently random data. Amajor part of this information is not relevant for the process.

The paper from the University of Helsinki entitled “Finding a GoodCollection of Patterns Covering a Set of Sequences” by A. Brazma, E.Ukkonen, J. Vilo [http://www.cs.helsinki.fi/TR/C-1995/60/] can beconsidered a basic reference for what concerns Pattern Discoverymethods. It shows with several example how it is possible to extractpatterns from a set of sequences and which of the possible coveringshould be considered the best one.

Document PCT/EP03/14385 discloses Host-based Intrusion Detection Systembased on Anomaly Detection that performs three steps: a learning phase(when the knowledge base is built by analyzing a predefined set ofsystem calls), a normalization phase (the knowledge base is off-lineoptimized by pruning useless data or translating it into a more compactform so to improve effectiveness of whole system) and an analysis phase(when the anomaly detection is really performed). Such an arrangementrelies i.a. on the concept that it is possible to monitor effectivelythe behaviour of a given application. This concept is broadly acceptedin the Intrusion Detection scientific community, as witnessed by e.g. S.Forrest and al. in “A Sense of Self for Unix Processes”, IEEE Symposiumon Security and Privacy, 1996, pp. 120-128. This article discusses amethod for anomaly detection based on the short-range correlation ofsequences of system calls.

Another example of Anomaly Detection Based IDS is described in documentUS2002/0138755-A1

OBJECT AND SUMMARY OF THE INVENTION

Despite the extensive efforts witnessed by the prior art discussed inthe foregoing, the need is felt for a truly efficient technique that mayminimize the False-Positive rate by further improving the way ofbuilding and refining the knowledge base in an Intrusion DetectionSystem based on the Anomaly Detection paradigm. This without undulyincreasing the size of the knowledge base and the amount of timeinvolved in the training phase.

The object of the invention is thus to provide a fully satisfactoryresponse to this need.

According to the present invention, that object is achieved by means ofa method having the features set forth in the claims that follow. Theinvention also relates to a corresponding Intrusion Detection System, anetwork including at least one system exposed to intrusions and undersurveillance by such an associated Intrusion Detection System as well asa related computer program product, loadable in the memory of at leastone computer and including software code portions for performing thesteps of the method of the invention when the product is run on acomputer. As used herein, reference to such a computer program productis intended to be equivalent to reference to a computer-readable mediumcontaining instructions for controlling a computer system to coordinatethe performance of the method of the invention. Reference to “at leastone computer” is evidently intended to highlight the possibility for thepresent invention to be implemented in a distributed/modular fashion.

The claims are an integral part of the disclosure of the inventionprovided herein.

A preferred embodiment of the invention is an Intrusion Detection Systembased on Anomaly Detection Paradigm, wherein intrusions in a systemunder surveillance are detected by analyzing events occurring duringoperation of said system under surveillance by matching said eventsoccurring during operation of said system against a knowledge baseincluding information on events occurred during a learning phase,operation of such a system including the steps of:

recording, during said learning phase, temporal data related to saidevents occurred during said learning phase;

identifying a dynamic part of said knowledge base as a function of saidtemporal data;

discovering patterns that cover said dynamic part of the knowledge base;and

using, during said analysis phase, a regular expression match at leastin respect of said dynamic part of the knowledge base.

The exemplary arrangement described herein aims at improving theeffectiveness of Intrusion Detection System based on Anomaly Detectionby applying Pattern Discovery techniques on those data that representsthe dynamic section of the monitored system.

In the solution described herein a Pattern Discovery method is used totransform the data in a more compact and significant form; all the dataanalyzed are regarded as significant for deriving the pattern, while inclassical bio-informatics application, there are few interestingsub-sequences in a sea of uninteresting data. Moreover, the patterndiscovery mechanism described herein is more concerned with theidentification of similarities existing between the strings thatrepresent resource properties that the Intrusion Detection System has tomonitor. The main goal of the solution described herein is thus findinga minimal coverage (in terms of patterns) of the set of strings that areanalyzed; this allows the Intrusion Detection System to manage alsoresources that have never been used during the learning stage, whichhowever are perfectly legitimate to use.

The solution described herein aims at minimizing the False-Positiverate, without increasing the size of the knowledge base and the amountof time required for creating such a knowledge base, by recording someextra information, related to the temporal behavior of the eventsrecorded during the learning stage. This data is then processed using aspecific pattern extraction technique, which generates a more compactknowledge base and also identifies extremely volatile and dynamicresource usage patterns, which cannot be analyzed using the classicalAnomaly Detection approach.

Roughly speaking, the basic concept underlying the solution describedherein is to detect and isolate areas of highly dynamic activity, whichwould be unmanageable using standard Anomaly Detection techniques, asthey would generate too many False-Positives.

In order to better understand this concept, one simply needs to considerthat a conventional Intrusion Detection System performing AnomalyDetection on the traffic directed to an Internet accessible web servermay collect all the IP addresses of clients seen during the trainingphase. Quite obviously, such a system will not be able to observe allthe possible IP addresses that will be connecting during the rest of theweb server activity (unless an infinite time of learning is assumed).Hence, at the end of the training phase, the system would still producean alert for each unobserved client connecting to the server.

A direct solution to this problem lies in the use of policies: theIntrusion Detection System will thus be instructed to the effect thatcertain resources of the system under surveillance must not bemonitored; the limit of this approach is the necessity of writing andmanaging this information for a large set of different computer systems.The language used to specify the policy should be rich enough to permita fine grained tuning; however, this opens the door to humanmisconfiguration errors, and it also requires to waste time and skill ofa human system administrator.

The solution described herein provides a method for detectingautomatically areas of high dynamic activity, where classical AnomalyDetection is avoided. This is obtained through additional time datacollected during the learning phase; this data is then processed usingstatistical analysis and a specific Pattern Discovery technique, whichconvert the list of dynamic resources in a more compact set of patterns.Patterns are a way to simplify the knowledge base; they also provide asimple mechanism that allows an Anomaly Detection system to match apotentially infinite number of events using only a single element of theknowledge base.

In that respect, one may consider again the example where an AnomalyDetection Intrusion Detection System monitors an Internet accessible webserver; by using the solution described herein, the Intrusion DetectionSystem is capable, without any human intervention, to recognize thatclient connections should be considered dynamic (as the server receivesa lot of connections, each with a different address, and each connectionhas a limited duration). Hence, the Intrusion Detection System will nottrigger any alarm when it sees a connection to the web server that doesnot come from one of the previously observed addresses. At the sametime, a Pattern Discovery technique can detect if all the connectionsare related; for example, if only clients from a specific subnet canaccess the web server, the network address and the network mask becomesthe patterns which the Intrusion Detection System can use for matching anew connection.

IP addresses, file names and any other resource attribute that may besignificant for the detection process can be described in terms ofcharacter strings with some specific formats (for example, IP addresshave all the format xxx.xxx.xxx.xxx, while filenames on Unix systemshave all the format /path1/path2/path3/ . . . /file). By using patternstructure and a specific pattern identification technique, it ispossible to substitute the set of IP address, filenames and any otherresource attribute that has been observed during the learning with asingle pattern.

BRIEF DESCRIPTION OF THE ANNEXED DRAWINGS

The invention will now be described, by way of example only, withreference to the enclosed figures of drawing, wherein:

FIG. 1 shows is a block diagram of operating phases in the exemplaryarrangement described herein;

FIG. 2 shows an example of a method of recording additional informationfor a resource within the framework of the exemplary arrangementdescribed herein;

FIG. 3 shows an example of distinction between stable and dynamicresources;

FIG. 4 shows an example of discover patterns;

FIG. 5 shows an example of a refining process within the framework ofthe exemplary arrangement described herein; and

FIG. 6 is a schematic state diagram of an Intrusion Detection System asdescribed herein.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION

The basic platform of the exemplary arrangement described herein is anAnomaly Based Intrusion Detection System. This can be of any known typeas discussed in the introductory portion of this description, thusmaking it unnecessary to provide further detail herein.

During a first stage called learning phase, such an Anomaly Based IDSrecords a certain amount of data about the system monitored (e.g. a hostcomputer) such as the processes running on the system, the credentialsunder which they run, the system resources they use. In doing so, theIDS creates a knowledge base that identifies the usual behavior of thesystem.

During a subsequent stage called the analysis phase, the data in theknowledge base is matched against the real-time state of the system andany difference between them (i.e. the system performs an action notpresent in the knowledge base, such as launching a new process, openinga new file, etc.) is considered an anomaly, hence an alert is issued.

As already explained in the introducing portion of the description, abasic issue arising in such an arrangement is that during normalbehavior the system under surveillance (hereinafter a host computer willbe primarily referred to for the sake of simplicity) always performssome actions not recorded during the learning phase. This can be forexample the case of an application opening client sockets, or a serverwriting temporary session files, or a legitimate user uploading a newfile in its own directory. All these actions are generally considered asanomalies and alerts will be issued resulting to be False-Positivealarms.

The solution described herein tackles such an issue by automatically“tuning” the knowledge base after the learning phase and changing itinto a more compact and semantically enriched form for the subsequentanalysis phase as shown in FIG. 1.

This optimization of the knowledge base, and involves the followingsteps:

-   -   record additional temporal data during the learning phase;    -   identify the dynamic part of the knowledge base;    -   discover patterns that cover the dynamic part of the knowledge        base;    -   use a regular expression match during the analysis phase        (instead of an exact match).

In particular, with reference to FIG. 1, it is possible to identifydifferent steps.

Step (a)—During the learning phase of an Anomaly Detection system,designated 101 in FIG. 1, several information of the monitored host arerecorded. These are for example: the processes running on the system,the credentials under which they run and the system resources they use.In the arrangement described herein, additional temporal information isrecorded, such as when a process starts and ends, when it requests andreleases a certain resource, how many times the resource is used duringthe learning period, and so on. So, in general, for every kind ofmonitored resource of the system (files, sockets, devices, registry,processes, users . . . ) the method records exactly when each resourceis used, meaning that the method tracks with a timestamp each instantwhen a resource is requested or released. From these data, othertemporal parameters can be calculated, such as the total number of useand the total amount time, as explained hereafter for the case of files.

In a first possible embodiment of the arrangement described herein thefile-system of a host computer/system is monitored.

In a step 102 the Anomaly Detection system records what files are openedand closed by which processes (resource list), and hence stores in theknowledge base the filename along with other attributes (temporalinformation) that can be considered useful (i.e. file size, file owner,file permissions, etc.).

In a presently preferred arrangement for each file these additionaltemporal information items are recorded:

-   -   a timestamp (TS) indicating when the file was opened for the        first time;    -   a timestamp (TE) indicating when the file was closed for the        last time;    -   an integer (N) indicating how many times the file was        opened/closed between these two instants;    -   an integer (Δ) indicating the number of milliseconds during        which the file was in open state.

Additionally, there are two global timestamps (TLS and TLE) indicatingthe beginning and the end of the learning phase.

FIG. 2 shows a graphical example of how all these values can be recordedfor a resource named R.

This kind of temporal data are exactly the same for sockets: in factthey are opened and closed as the files. For process monitoring TS, thetimestamp when the first instance of a process starts, and TE, thetimestamp when the last instance exits, are recorded and so on for everyother kind of system resource.

Step (b)—With the additional data recorded in Step (a) it is possible topredict part of the future behavior of the system. In order to do so, ina step 103, the whole knowledge base is divided in two parts, designateda stable part and a dynamic part, respectively.

The stable part, which is the resource list indicated 104 in FIG. 1, isthe one with the resources whose temporal data indicate they wereintensively used during the learning period. The dynamic part, which isthe resource list indicated 105 in FIG. 1, on the other hand, isrepresented by the set of resources that were used seldom during thesame period. Of course the distinction between intense and rare use of aresource can be made with several possible comparison expressions. Byusing certain parameters as thresholds, it is possible to tune theexpressions in order to best fit the situation of the monitored system.For instance, by defining T1 a percentage threshold that indicates whatis considered to be rare in comparison to the whole period of learning(i.e. 0.1%), and T2 an integer threshold that indicates a minimum numberof use of a resource (i.e. 5), some possible expressions to define atemporary (i.e. dynamic) resource “i” can be:(TE _(i) −TS _(i))<(TLE−TLS)*T1(Δ_(i)<(TLE−TLS)*T1) AND (N _(i) <T2)

Depending on the monitored resource considered, and the kind of systemthe Host-Based Intrusion Detection Systems runs on, other possiblecomparison expressions can be used.

Those of skill in the art will promptly appreciate that those given inthe art are just examples of a notionally boundless number of ways ofachieving the same basic result of partitioning the whole knowledge basein a stable part and a dynamic part.

In the exemplary embodiment described herein, by referring to filesystemmonitoring, a division is obtained between regular files, such asconfiguration files, libraries, executables, documents, etc. andtemporary files such as cookies, session files, process info, etc. Thisdivision however is not made with static rules or policy, by looking atfile type or at their location in the system, but only with the extratemporal data recorded for each file. Several possible filters can beused, depending on the level of accuracy and customization desired. Forfilesystem monitoring, using the first expression above may suffice.This means that a file is considered to be temporary if it is used for avery short period over the whole learning phase, even if during thatperiod it is used several times.

FIG. 3 shows an example of distinction between stable and dynamicresources. Reference number 300 designates the learning phase as awhole, and reference numbers 310 and 330 designate two stable resources,while reference number 320 designates a dynamic resource.

If applied to sockets, this step automatically creates a distinctionbetween server sockets (always listening for connections) and clientsockets (opened only for short times and usually each time on differentports).

Step (c)—When the set of dynamic data, 105 in FIG. 1, is defined, theaim of a step 106 is to try to extract from it a set of patterns inorder to reduce the knowledge base size and to identify areas that canbe considered dynamic instead of a simple data list. For resourcesrecorded as strings (filenames, pathnames, device names, etc.) this willresult in finding one or more regular patterns that cover the wholedynamic set.

By “regular pattern” those patterns are meant having only one variablesymbol that can appear only once: if the strings are composed bycharacters from the alphabet Σ, regular patterns are nonempty stringsfrom the same alphabet Σ, that must also contain one special character*∉ Σ, called jolly. A pattern is considered to cover a set of strings ifall strings in the set can be obtained from the pattern by substitutingthe jolly with an appropriate sequence of characters, i.e. the strings“home”, “house”, “horse” are covered by the pattern “ho*e”

More detailed information about this way of handling patterns can bederived by documents such as: D. Angluin “Finding patterns common to aset of strings”, Journal of Computer and System Sciences 21, pp. 46-62(1980), and A. Brazma, E. Ukkonen, J. Vito “Finding a good collection ofpatterns covering a set of sequences”, Report C-1995-60, Department ofComputer Science, University of Helsinki, December 1995.

To obtain significant patterns, some semantic issues are taken intoaccount. By “significant patterns” those patterns are meant that shouldidentify the dynamic areas of the knowledge base, without covering thestable part. For that purpose, a Pattern Discovery process is applied tothe correct set of data (i.e. by dividing the whole set of dynamic datainto sub-sets) and with correctly tuned thresholds. The way to dividethe set of dynamic data depends on the semantic of data themselves andon how they are recorded in the knowledge base: for files, this willoccur based on the filesystem structure (with pathnames like“/<dir>/<sub-dir>/ . . . /<filename>” for UNIX-like systems or“<disk>:\<dir>\<sub-dir>\ . . . \<filename>” for Microsoft systems), forsocket connections this will occur based on a structure like“<protocol>:<port>:<local address>:<remote address>”, for users, thebasic factor can be the string “<domain>\<username>” and so on for eachresource of the system. This means that whatever the resources tomonitor, they will be recorded with a certain semantics in the knowledgebase, and the same semantic must be followed to divide the dynamic setin the optimization process (files divided into directories, socketdivided into protocol:port groups, users divided in domains, and so on).

Referring to the exemplary arrangement described herein, FIG. 4 showshow patterns may be derived from a list of complete pathnames designated401. First of all the list is divided in subsets corresponding to thedirectory structure of the filesystem, designated 402. Then the methodis applied only to filenames of the same directory. This is done inorder to avoid discovering patterns such as “/*” (for UNIX-likeoperating systems) or “C:\*” (for Microsoft operating systems) thatwould cover the whole filesystem.

Another issue to be considered is the use of thresholds in order toextract a “good” set of patterns.

While an in-depth description of what can be considered a “good” set canbe found in prior art documents such as the two last cited in theforegoing, the arrangement described herein tries to avoid, insofar aspossible, discovering patterns such as “*” that covers the whole set, orto find as many patterns as the cardinality of the set itself (which isof course useless). By defining a percentage threshold T3 the solutiontries to obtain a set of longest possible patterns but with a maximumcardinality of T3 percent of the cardinality of the set of strings to becovered.

FIG. 4 shows an example where the directory /var/log of a Linux Systemsis considered. This directory contains some files (i.e. messages,security.log, syslog, urpmi.log, user.log) that can be consideredstable, and designated 403, since they are log files that are regularlyincremented with log entries by several applications. However, when theygrow too much, a rotation method stores them in compressed files in thesame directory (i.e. messages.1.gz, messages.2.gz, security.log.1.gz,etc.). These compressed files can be considered dynamic, and designated402, since they are opened/written/closed only once. In the solutiondescribed herein, a standard Anomaly Detection System would reportanomalies every time such compression will take place, since the name ofthe compressed file will always be new (with an incrementing number inthe name). On the contrary, with the proposed Pattern Discovery methodall these future files are covered. FIG. 4 also shows the importance ofthe correct settings of the threshold T3 since the resulting set ofpatterns, designated 404, 405, and 406 depends on its value; being inthis case 25% the value that brings the best pattern set.

As schematically shown in FIG. 5, the method of Pattern Discovery can bedescribed with these steps:

1—set a threshold (e.g. T3=20%) and calculate from the File Set 500 themaximum number of patterns, which is equal to 4 in the example shown;

2—start with the pattern set 501 with only one pattern of zerocharacters (only the jolly “*”);

3—use one more character and check how many patterns are needed to coverthe whole set (step 502);

4—if they are less than the fixed threshold the pattern set is updatedand the previous step 3 is repeated, else stop.

The blocks designated 503 to 507 are exemplary of the possiblerepetition of these steps with increased numbers of characters in thepattern set.

The transition from the step 503 to the step 504 is exemplary of thepossibility that the elements in the set are in excess of the maximumnumber of patterns (4 in this case) whereby e.g. the character “u*”cannot refined.

It is then possible to refine patterns by applying the same method oneach pattern backward from the end of filenames. FIG. 5 shows an exampleof this refining process.

The method can be applied to sockets recorded as strings in the form of“<protocol>:<port>:<local address>:<remote address>”. Several serverapplications are running on the system monitored; many of the socketsused by each server will be considered dynamic since the remote addresswill be very often different and using the resource only for a shortperiod of time (assuming to receive many client connections from manydifferent hosts, as it can be the case of a web server or a public ftpserver).

As a first step, the dynamic socket set will be divided into sub-setsaccording to the “<protocol>:<port>”; i.e. for a web server “HTTP:80”,or “FTP:21” for an ftp server, and so on. Then from each sub-set it willbe possible to try to discover patterns in the remote address part ofthe socket name (the local address should always be the one of themonitored host). The method will therefore extract patterns such as“HTTP:80:localhost:*” in case of a public web server, but it could alsofind more specific patterns such as “HTTP:80:localhost:10.10.*” if theweb server is used for a company intranet and it is hence accessed onlyfrom a class of IP addresses.

After this process is executed on all the monitored resources, theknowledge base used in the analysis phase, designated 107 in FIG. 1,will be the stable knowledge base plus the dynamic knowledge base withpatterns. So globally the knowledge base is more compact (the temporaryfile list is reduced by a factor T3) but it is more rich in semanticsince it has discovered that the normal behavior of the system is notonly the one recorded in learning phase, but it can be extended to arandom behavior in the dynamic parts of the system. Hence, normalbehavior is equal to the stable part (exact match with recorded values)plus the dynamic part (patterns that describe dynamic zones of thesystem).

The whole stable part is maintained even if some resource can result tobe covered by patterns because for each stable resource are maintainedalso other properties (such as size, owner, permissions, etc.) that canbe used in analysis to reveal anomalies; while resources covered bypatterns are considered to be dynamic and hence no further detection isapplied on them.

Step (d)—The analysis phase of an Anomaly Detection system typicallyjust matches the instant behavior of the system with the one recorded inlearning phase. Doing so, every process, file or other resource used bythe system which is not recorded in the knowledge base is considered tobe an anomaly. But using a regular-expression like matching, instead ofan exact match, the resource used can be covered by a pattern instead ofbeing exactly present as it is. In fact “regular patterns” are verysimple regular expressions, with only one jolly character.

Consequently, during the analysis phase, designated 108 in FIG. 1, everyresource used by the system is first searched in the stable part of theknowledge base and, if it is found, an in-depth analysis can beperformed by checking other monitored properties of the resource (size,owner, permissions, etc.). On the other hand, if the resource is notfound in the stable part of the knowledge base, instead of issuing analert, it can be searched with a regular-expression like match in theset of patterns created in Step (c). If it is covered by a pattern itcan be assumed it is a temporary resource, or a minor anomaly that inany case must not be signaled because it would represent aFalse-Positive alarm. Of course if neither the patters cover theresource, an alert should be emitted.

Quite obviously, the exemplary embodiment just described is only a firstpossible implementation of the solution described herein. Some major orminor change and improvement can be made in future alternativeembodiments.

Specifically, all the resources (i.e. also processes, devices, registry,etc., and not only files or sockets) used by the system undersurveillance can be monitored in the way described herein.

FIG. 6 is a schematic state diagram of an Intrusion Detection Systemconfigured for operating along the lines described in detail in theforegoing.

From an initial state, designated 601 in FIG. 6, where the IDS is notactive, the learning stage is started. In the learning state, designated602 in FIG. 6, the IDS records 607 all events occurring in the systemand creates its Knowledge Bases (KBs). When the learning process isstopped, the IDS is in a wait state, designated 603 in FIG. 6, ready tooptimize its KBs. The optimization process, designated 604 in FIG. 6, iswhat this disclosure is all about. With the optimized KBs, the IDS is inanother wait state, designated 605 in FIG. 6, ready to be used inanalyze mode. In analyze state, designated 606 in FIG. 6, the IDSperforms the job 608, for which it was actually designed: it monitorsall the events occurring in the system and it emits alarms in case ofmalicious behaviour.

Further possible developments of the basic arrangement described indetail in the foregoing include, i.a.:

-   -   improved pattern syntax, evolving from regular patterns to        regular expressions, which are much richer in semantic and        flexibility, as also shown in FIG. 4 with reference number 407;    -   adopting the same approach as previously described for resources        not recorded as strings or, more generally, as sequences from        which some patterns can be easily discovered. This is the case        of binary data such as network packet payloads, or whole binary        files (as for example is made for integrity check and data        recovery).

Consequently, without prejudice to the underlying principles of theinvention, the details and the embodiments may vary, also appreciably,with reference to what has been described by way of example only,without departing from the scope of the invention as defined by theannexed claims.

1. A computer implemented method of detecting intrusions in a systemunder surveillance, wherein said intrusions are detected, using acomputer, by analyzing events occurring during operation of said systemunder surveillance by matching said events against a knowledge basecomprising information on events which occurred during a learning phase,the method comprising: recording, using a computer, during said learningphase, temporal data that includes a timestamp corresponding to when aspecific system resource was used relating to said events which occurredduring said learning phase, said learning phase occurring at a differenttime from an analysis phase; identifying, using a computer, a dynamicpart of said knowledge base as a function of said temporal data, whereinsaid dynamic part includes a list of system resources whose temporaldata indicate that use of said system resources is below a predeterminedthreshold level of use during said learning phase; discovering, using acomputer, patterns that cover said dynamic part of the knowledge base;and performing, using a computer, during said analysis phase, a regularexpression match at least with respect to said dynamic part of theknowledge base.
 2. The method of claim 1, wherein said step of analyzingcomprises matching against said knowledge base, information related toat least one of the entities selected from: the processes running onsaid system under surveillance, the credentials under which theseprocesses are run and the system resources these processes use.
 3. Themethod of claim 1, wherein said step of recording comprises recordingduring said learning phase, temporal data selected from: when a processrun on said system under surveillance starts and/or ends, when saidprocess requests and/or releases a certain resource, and how many timesthe resource is used during said learning phase.
 4. The method of claim3, wherein said step of recording comprises tracking with a timestampduring said learning phase each instant when any resource used by saidsystem under surveillance is requested or released.
 5. The method ofclaim 3, comprising the steps of: recording, using a computer, duringsaid learning phase, what files are opened and/or closed by whatprocesses on said system under surveillance; and storing, using acomputer, in said knowledge base, along with said temporal data, atleast one of the associated file names, the file size, the file ownerand the file permissions.
 6. The method of claim 1, wherein said step ofrecording comprises recording during said learning phase temporal datarelated to said system under surveillance selected from: a firsttimestamp indicating when a given file was opened for the first time; asecond timestamp indicating when said given file was closed for the lasttime; an integer indicating how many times said given file was openedand/or closed between the instants identified by said first and secondtimestamps; and an integer indicating the amount of time a given filewas in the open state.
 7. The method of claim 1, wherein said step ofrecording comprises recording during said learning phase two globaltimestamps indicating the beginning and the end of said learning phase.8. The method of claim 1, wherein said step of recording comprisesrecording during said learning phase temporal data related to saidsystem under surveillance selected from: instants when sockets arecreated and/or destroyed; a timestamp when the first instance of a givenprocess run on said system under surveillance starts; and a furthertimestamp when the last instance of said given process exits.
 9. Themethod of claim 1, comprising the step of identifying, using a computer,within said knowledge base, a stable part comprising a list of resourcesof said system under surveillance for which said temporal data indicatea level of use during said learning phase above at least one respectivethreshold of use.
 10. The method of claim 1, comprising the steps of:defining, using a computer, a first threshold that indicates what isconsidered to be a rare event in comparison with a whole duration ofsaid learning phase, and a second threshold that indicates a minimumnumber of occurrences of an event over said learning phase; andidentifying, using a computer, said dynamic part of said knowledge baseas a function of said first and second thresholds.
 11. The method ofclaim 1, wherein said step of discovering comprises extracting from saiddynamic part of the knowledge base, a set of patterns identifying insaid dynamic part of the knowledge base, areas that can be considereddynamic instead of simple data lists.
 12. The method of claim 11,wherein, for resources of said system under surveillance recorded asstrings, said step of discovering comprises finding at least one regularpattern that covers the whole dynamic set.
 13. The method of claim 12,comprising the step of selecting, using a computer, said regularpatterns as patterns having only one variable symbol that can appearonly once.
 14. The method of claim 12, comprising the step of selecting,using a computer, said regular patterns as follows: if the strings arecomposed of characters from the alphabet, regular patterns are non-emptystrings from the same alphabet that also contain one special charactercalled jolly; and considering a pattern as a pattern covering a set ofstrings if all strings in the set can be obtained from the pattern bysubstituting the jolly with an appropriate sequence of characters. 15.The method of claim 11, comprising the steps of: identifying, using acomputer, within said knowledge base, a stable part comprising a list ofresources of said system under surveillance for which said temporal dataindicate a level of use during said learning phase above at least onerespective threshold of use; and extracting, using a computer, from saiddynamic part of the knowledge base a set of significant patterns thatshould identify dynamic areas of the knowledge base without coveringsaid stable part.
 16. The method of claim 15, comprising the step ofapplying, using a computer, a pattern discovery process to at least oneset of dynamic data based on the semantic thereof.
 17. The method ofclaim 16, comprising the step of dividing, using a computer, said set ofdynamic data into sub-sets by means of tuned thresholds.
 18. The methodof claim 16, comprising the step of applying, using a computer, apattern discovery process based on at least one of: file systemstructure or directories for files, the protocol:port groups forsockets, and the domains for users.
 19. The method of claim 14,comprising the step of applying a pattern discovery process, using acomputer, to at least one set of dynamic data based on the semanticsthereof, wherein said pattern discovery process comprises the steps of:a) setting a given threshold and calculating a maximum number ofpatterns; b) initializing the pattern set with a single pattern of zerocharacters, which means to have only the jolly; c) using one morecharacter; d) checking how many patterns are needed to cover the wholeset; and e) if these patterns are less than said fixed threshold,updating the pattern set and going back to step c), else stop.
 20. Themethod of claim 19, comprising the step of refining said patterns, usinga computer, by applying the same method on each pattern backward fromthe end of file names.
 21. The method of claim 16, comprising the stepof applying said pattern discovery process, using a computer, to socketsrecorded as strings in the form of “<protocol>:<port>:<localaddress>:<remote address>” by: dividing the dynamic socket set intosub-sets according to the “<protocol>:<port>” semantics; and subjectingeach sub-set to a discovery process to discover patterns in the remoteaddress part of the socket name.
 22. The method of claim 1, wherein saidanalysis phase comprises the step of performing, using a computer, aregular-expression matching, whereby said matching against saidknowledge base is performed on the basis of pattern matching instead ofexact matching.
 23. The method of claim 22, comprising the steps of:identifying, using a computer, within said knowledge base, a stable partcomprising a list of resources of said system under surveillance forwhich said temporal data indicate a level of use during said learningphase over at least one respective threshold of use; and during saidanalysis phase, searching, using a computer, for any resource used bysaid system under surveillance in said stable part of said knowledgebase, and i) if said resource is found in said stable part, performingan in-depth analysis by checking other monitored properties of theresource, and ii) if said resource is not found in said stable part,searching said resource with a regular-expression like match in saiddynamic part and emitting an alert if said resource is not found to becovered by any of said patterns in said dynamic part of said knowledgebase.
 24. An intrusion detection computer system for detectingintrusions in a host computer system under surveillance, wherein saidintrusion detection computer system comprises at least one computer andis configured to analyze events occurring during operation of said hostcomputer system under surveillance by matching said events occurringduring operation of said host computer system against a knowledge basecomprising information on events which occurred during a learning phase,the intrusion detection computer system being configured for performingthe method of claim
 1. 25. A network comprising at least one systemexposed to intrusions, said network being under surveillance by anintrusion detection computer system according to claim
 24. 26. Anon-transitory computer readable medium encoded with a computer programproduct, loadable into a memory of at least one computer and comprisingsoftware code portions for performing the method of claim 1.